model quality
- North America > United States > Texas > Harris County > Houston (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- North America > United States > Louisiana > Orleans Parish > New Orleans (0.04)
- Asia > China > Hubei Province > Wuhan (0.04)
- North America > Canada (0.04)
- (8 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.92)
- Information Technology > Communications > Social Media > Crowdsourcing (0.87)
- Information Technology > Data Science (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
- North America > United States > Texas > Harris County > Houston (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
Fine-tuning Language Models over Slow Networks using Activation Quantization with Guarantees
Communication compression is a crucial technique for modern distributed learning systems to alleviate their communication bottlenecks over slower networks. Despite recent intensive studies of gradient compression for data parallel-style training, compressing the activations for models trained with pipeline parallelism is still an open problem. In this paper, we propose AQ-SGD, a novel activation compression algorithm for communication-efficient pipeline parallelism training over slow networks.
Matryoshka Model Learning for Improved Elastic Student Models
Verma, Chetan, Timmaraju, Aditya Srinivas, Hsieh, Cho-Jui, Damle, Suyash, Bui, Ngot, Zhang, Yang, Chen, Wen, Liu, Xin, Jain, Prateek, Dhillon, Inderjit S
Industry-grade ML models are carefully designed to meet rapidly evolving serving constraints, which requires significant resources for model development. In this paper, we propose MatTA, a framework for training multiple accurate Student models using a novel Teacher-TA-Student recipe. TA models are larger versions of the Student models with higher capacity, and thus allow Student models to better relate to the Teacher model and also bring in more domain-specific expertise. Furthermore, multiple accurate Student models can be extracted from the TA model. Therefore, despite only one training run, our methodology provides multiple servable options to trade off accuracy for lower serving cost. We demonstrate the proposed method, MatTA, on proprietary datasets and models. Its practical efficacy is underscored by live A/B tests within a production ML system, demonstrating 20% improvement on a key metric. We also demonstrate our method on GPT-2 Medium, a public model, and achieve relative improvements of over 24% on SAT Math and over 10% on the LAMBADA benchmark.
- North America > United States > California > Los Angeles County > Los Angeles (0.14)
- North America > Canada > Ontario > Toronto (0.05)
- North America > United States > California > Santa Clara County > Mountain View (0.05)
- (3 more...)
- Education > Educational Technology > Educational Software (1.00)
- Education > Educational Setting (0.94)
D-com: Accelerating Iterative Processing to Enable Low-rank Decomposition of Activations
Tahmasebi, Faraz, Pelluer, Michael, Kwon, Hyoukjun
The computation and memory costs of large language models kept increasing over last decade, which reached over the scale of 1T parameters. To address the challenges from the large scale models, model compression techniques such as low-rank decomposition have been explored. Previous model decomposition works have focused on weight decomposition to avoid costly runtime decomposition, whose latency often significantly exceeds the benefits from decomposition (e.g., 38% more end-to-end latency when running Llama2-7b on A100 with 4K sequence length with activation decomposition compared to no decomposition). In this work, we debunk such observations and report that the input decomposition can be significantly beneficial with a proper choice of decomposition algorithm and hardware support. We adopt progressive decomposition algorithm, Lanczos algorithm, and design a co-accelerator architecture for the decomposition algorithm. To address the memory- boundness of the decomposition operation, we introduce a novel compute replication methodology that moves the op- eration toward compute-bound region, which enables 6.2x speedup in our evaluation. We also develop an output shape- preserving computation scheme that eliminates decomposi- tion costs in consecutive layers. To compensate model quality loss from compression, we introduce a multi-track decom- position approach that separately handles outlier channels for high accuracy and low perplexity with minimal compu- tational costs. Combined together, our accelerator, D-com, provides 22% end-to-end latency improvements compared to A100 GPU at the cost of small model quality degradation (e.g., 3% on AI2 Reasoning Challenge task).
- North America > United States > California > Orange County > Irvine (0.14)
- Europe > Austria > Vienna (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- (2 more...)
- North America > United States > Texas > Harris County > Houston (0.04)
- South America > Chile > Santiago Metropolitan Region > Santiago Province > Santiago (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- North America > United States > California (0.28)
- Asia > China > Hubei Province (0.14)
- Oceania > Australia (0.14)
- (2 more...)
- Information Technology > Artificial Intelligence > Machine Learning > Learning Graphical Models > Directed Networks > Bayesian Learning (0.92)
- Information Technology > Communications > Social Media > Crowdsourcing (0.87)
- Information Technology > Data Science (0.67)
- Information Technology > Artificial Intelligence > Representation & Reasoning (0.67)
- North America > United States > Texas > Harris County > Houston (0.04)
- North America > United States > New Jersey > Hudson County > Hoboken (0.04)
- Asia > Middle East > Jordan (0.04)
GUIDE: Guided Initialization and Distillation of Embeddings
Trinh, Khoa, Menghani, Gaurav, Vee, Erik
Algorithmic efficiency techniques such as distillation (\cite{hinton2015distillation}) are useful in improving model quality without increasing serving costs, provided a larger teacher model is available for a smaller student model to learn from during training. Standard distillation methods are limited to only forcing the student to match the teacher's outputs. Given the costs associated with training a large model, we believe we should be extracting more useful information from a teacher model than by just making the student match the teacher's outputs. In this paper, we introduce \guide (Guided Initialization and Distillation of Embeddings). \guide can be considered a distillation technique that forces the student to match the teacher in the parameter space. Using \guide we show 25-26\% reduction in the teacher-student quality gap when using large student models (400M - 1B parameters) trained on $\approx$ 20B tokens. We also present a thorough analysis demonstrating that \guide can be combined with knowledge distillation with near additive improvements. Furthermore, we show that applying \guide alone leads to substantially better model quality than applying knowledge distillation by itself. Most importantly, \guide introduces no training or inference overhead and hence any model quality gains from our method are virtually free.
- South America > Colombia > Meta Department > Villavicencio (0.04)
- North America > United States > California > Santa Clara County > Mountain View (0.04)
- Europe > Ireland > Leinster > County Dublin > Dublin (0.04)